Product Reviews Clustering

Author

Yuzhou Fu

1. Project Introduction

In the fast-goods industry, customer feedback is crucial for maintaining product quality and brand reputation. With thousands of reviews submitted daily on platforms like Sephora, manually analyzing this unstructured text data is impossible.

The motivation and goal for this analysis is to apply unsupervised learning techniques to automatically uncover distinct themes within customer reviews, transforming raw text information into actionable business insights.

The intended audience for this analysis is the product development or other related teams at a beauty company. By identifying specific clusters of customer feedback, the team can take targeted next steps: for instance, if a cluster reveals consistent complaints about “leaking bottles”, the team can redesign the packaging; if another cluster highlights “skin irritation”, the formula can be revisited for safety testing.

The dataset we will be using is Sephora Products and Skincare Reviews from Kaggle, which was collected via Python scraper in March 2023 and contains:

  • Product Dataset: information about all beauty products (over 8,000) from the Sephora online store, including product and brand names, prices, ingredients, ratings, and all features.

  • Customer Review Dataset: user reviews (about 1 million on over 2,000 products) of all products from the Skincare category, including user appearances, and review ratings by other users.

Given the limited computational resources available and practical purposes for this project, we adopted a two-stage data approach. The complete dataset was used for data cleaning and EDA to identify global trends. For the computationally intensive clustering algorithms, we utilized a subset of approximately 10,000 stratified sample reviews to ensure efficient model training and parameter tuning.

Code
# import relevant libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


from collections import defaultdict
from wordcloud import STOPWORDS
import string

import ast
import re

import operator
import unicodedata
from collections import Counter

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
from scipy.spatial.distance import pdist, squareform
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN

from wordcloud import WordCloud
import math

plt.style.use("seaborn-v0_8-white")

2. Data Loading

Load product dataset.

Code
product_df = pd.read_csv("dataset/Sephora/product_info.csv")
product_df.head()
product_id product_name brand_id brand_name loves_count rating reviews size variation_type variation_value ... online_only out_of_stock sephora_exclusive highlights primary_category secondary_category tertiary_category child_count child_max_price child_min_price
0 P473671 Fragrance Discovery Set 6342 19-69 6320 3.6364 11.0 NaN NaN NaN ... 1 0 0 ['Unisex/ Genderless Scent', 'Warm &Spicy Scen... Fragrance Value & Gift Sets Perfume Gift Sets 0 NaN NaN
1 P473668 La Habana Eau de Parfum 6342 19-69 3827 4.1538 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 85.0 30.0
2 P473662 Rainbow Bar Eau de Parfum 6342 19-69 3253 4.2500 16.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0
3 P473660 Kasbah Eau de Parfum 6342 19-69 3018 4.4762 21.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0
4 P473658 Purple Haze Eau de Parfum 6342 19-69 2691 3.2308 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL ... 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0

5 rows × 27 columns

2.1 Column Type Inconsistency (Review Dataset)

Load customer review dataset.

Code
review_df_1 = pd.read_csv("dataset/Sephora/reviews_0-250.csv", index_col=0)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_31388\3989447197.py:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  review_df_1 = pd.read_csv("dataset/Sephora/reviews_0-250.csv", index_col=0)

By loading the first partial customer review dataset, we see that there are columns having data type inconsistency. We need to check and correct this issue for each partial review dataset.

The problematic columns are author_id and user profile columns, and the primary problem is that some author_ids are stored as string type instead of an integer, every new encountered string author_id will be replaced with an integer, starting from 1. User profile columns such as skin_tone and eye_color contain both NA values and strings. All NA values in these columns have been replaced with ‘No_profile’. Same to the review_text and review_title, all NA values will be replaced with ‘No_review’ and ‘No_review_title’ respectively.

Code
# check data tpye in each column 
def column_type_check(df): # data frame
    print('Columns that have multiple data types: ')

    for column in df.columns:
        if len(df[column].apply(type).value_counts()) >= 2:
            print(' ', column)

2.1.1 First Partial Review Dataset

Code
review_df_1 = pd.read_csv("dataset/Sephora/reviews_0-250.csv", index_col=0)
column_type_check(review_df_1)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_31388\467310262.py:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  review_df_1 = pd.read_csv("dataset/Sephora/reviews_0-250.csv", index_col=0)
Columns that have multiple data types: 
  author_id
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
#author_id
mask = (review_df_1.loc[:,'author_id'] == 'dummyUser')
review_df_1.loc[mask,'author_id'] = 1

mask = review_df_1.iloc[:,0].apply(lambda x: isinstance(x, str)) 
a = review_df_1[mask]
mask_idx = a[a.loc[:,'author_id'].str.contains(r'order', regex=False)].index # for the str 'order...'

unique_id = review_df_1.loc[mask_idx, 'author_id'].unique()

# create id mapping for storing id generated 
author_id_mapping = {old: new_id for new_id, old in enumerate(unique_id, start= 2)}

review_df_1.loc[mask_idx, 'author_id'] = review_df_1.loc[mask_idx, 'author_id'].map(author_id_mapping).astype(int)

review_df_1['author_id'] = review_df_1['author_id'].astype(int)

#review_text
mask = review_df_1.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_1.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_1.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_1.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_1.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_1.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_1.loc[mask,'hair_color'] = 'No_profile'

2.1.2 Second Partial Review Dataset

Code
review_df_2 = pd.read_csv("dataset/Sephora/reviews_250-500.csv", index_col=0)
column_type_check(review_df_2)
Columns that have multiple data types: 
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique()
array(['orderGen1254820', 'orderGen1221842', 'orderGen1698648',
       'orderGen53499', 'orderGen51156', 'orderGen333757',
       'orderGen5563740'], dtype=object)
Code
list(author_id_mapping.keys())
['orderGen51156',
 'orderGen2124216',
 'orderGen703225',
 'orderGen5563740',
 'orderGen270100',
 'orderGen1221842',
 'orderGen1254820',
 'orderGen1253445',
 'orderGen1937304',
 'orderGen3046665',
 'orderGen1711826',
 'orderGen309293',
 'orderGen1698648',
 'orderGen39837',
 'orderGen899861']
Code
# author_id
print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique()
      , '\n')

check_idx = np.isin(review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

to_be_added = review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].unique()[~check_idx]

author_id_mapping_conca = {old: new_id for new_id, old in enumerate(to_be_added, start= 17)}

author_id_mapping = author_id_mapping | author_id_mapping_conca

review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'] = (
    review_df_2.loc[~ review_df_2['author_id'].str.isnumeric(), 'author_id'].map(author_id_mapping).astype(int)
)

review_df_2['author_id'] = review_df_2['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861'] 

unique string id in this partial dataframe:  ['orderGen1254820' 'orderGen1221842' 'orderGen1698648' 'orderGen53499'
 'orderGen51156' 'orderGen333757' 'orderGen5563740'] 

author_id is in the author id mapping:  [ True  True  True False  True False  True]
Code
#review_text
mask = review_df_2.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_2.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_2.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_2.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_2.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_2.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_2.loc[mask,'hair_color'] = 'No_profile'

2.1.3 Third Partial Review Dataset

Code
review_df_3 = pd.read_csv("dataset/Sephora/reviews_500-750.csv", index_col=0)
column_type_check(review_df_3)
Columns that have multiple data types: 
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
# author_id
print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].unique()
      , '\n')

check_idx = np.isin(review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

to_be_added = review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].unique()[~check_idx]

author_id_mapping_conca = {old: new_id for new_id, old in enumerate(to_be_added, start= 19)}

author_id_mapping = author_id_mapping | author_id_mapping_conca

review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'] = (
    review_df_3.loc[~ review_df_3['author_id'].str.isnumeric(), 'author_id'].map(author_id_mapping).astype(int)
)

review_df_3['author_id'] = review_df_3['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861', 'orderGen53499', 'orderGen333757'] 

unique string id in this partial dataframe:  ['orderGen5563740' 'orderGen1474435' 'orderGen1698648'] 

author_id is in the author id mapping:  [ True False  True]
Code
#review_text
mask = review_df_3.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_3.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_3.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_3.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_3.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_3.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_3.loc[mask,'hair_color'] = 'No_profile'

2.1.4 Fourth Partial Review Dataset

Code
review_df_4 = pd.read_csv("dataset/Sephora/reviews_750-1250.csv", index_col=0)
column_type_check(review_df_4)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_31388\136717371.py:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  review_df_4 = pd.read_csv("dataset/Sephora/reviews_750-1250.csv", index_col=0)
Columns that have multiple data types: 
  author_id
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
Code
# author_id
mask = review_df_4['author_id'].apply(lambda x: isinstance(x, str))
a = review_df_4.loc[mask, 'author_id']

print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', a[~ a.str.isnumeric()].unique()
      , '\n')

check_idx = np.isin(a[~a.str.isnumeric()].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

to_be_added = a[~a.str.isnumeric()].unique()[~check_idx]

author_id_mapping_conca = {old: new_id for new_id, old in enumerate(to_be_added, start= 20)}

author_id_mapping = author_id_mapping | author_id_mapping_conca


mask_idx = a[a.str.contains(r'order', regex=False)].index # for the str 'order...'

review_df_4.loc[mask_idx, 'author_id'] = review_df_4.loc[mask_idx, 'author_id'].map(author_id_mapping).astype(int)

review_df_4['author_id'] = review_df_4['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861', 'orderGen53499', 'orderGen333757', 'orderGen1474435'] 

unique string id in this partial dataframe:  ['orderGen1698648' 'orderGen1566769' 'orderGen1221842' 'orderGen2124216'
 'orderGen3046665' 'orderGen51156' 'orderGen1947347'] 

author_id is in the author id mapping:  [ True False  True  True  True  True False]
Code
#review_text
mask = review_df_4.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_4.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_4.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_4.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_4.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_4.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_4.loc[mask,'hair_color'] = 'No_profile'

2.1.5 Fifth Partial Review Dataset

Code
review_df_5 = pd.read_csv("dataset/Sephora/reviews_1250-end.csv", index_col=0)
column_type_check(review_df_5)
Columns that have multiple data types: 
  author_id
  review_text
  review_title
  skin_tone
  eye_color
  skin_type
  hair_color
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_31388\1811167476.py:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  review_df_5 = pd.read_csv("dataset/Sephora/reviews_1250-end.csv", index_col=0)
Code
# author_id
mask = review_df_5['author_id'].apply(lambda x: isinstance(x, str))
a = review_df_5.loc[mask, 'author_id']

print('author_id that have been replaced with a new id: ', list(author_id_mapping.keys()), '\n')
print('unique string id in this partial dataframe: ', a[~ a.str.isnumeric()].unique()
      , '\n')

check_idx = np.isin(a[~a.str.isnumeric()].unique(), 
        list(author_id_mapping.keys()))

print('author_id is in the author id mapping: ', check_idx)

mask_idx = a[a.str.contains(r'order', regex=False)].index # for the str 'order...'

review_df_5.loc[mask_idx, 'author_id'] = review_df_5.loc[mask_idx, 'author_id'].map(author_id_mapping).astype(int)

review_df_5['author_id'] = review_df_5['author_id'].astype(int)
author_id that have been replaced with a new id:  ['orderGen51156', 'orderGen2124216', 'orderGen703225', 'orderGen5563740', 'orderGen270100', 'orderGen1221842', 'orderGen1254820', 'orderGen1253445', 'orderGen1937304', 'orderGen3046665', 'orderGen1711826', 'orderGen309293', 'orderGen1698648', 'orderGen39837', 'orderGen899861', 'orderGen53499', 'orderGen333757', 'orderGen1474435', 'orderGen1566769', 'orderGen1947347'] 

unique string id in this partial dataframe:  ['orderGen1947347' 'orderGen1698648' 'orderGen3046665'] 

author_id is in the author id mapping:  [ True  True  True]
Code
#review_text
mask = review_df_5.iloc[:,8].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'review_text'] = 'No_review'

#review_title
mask = review_df_5.iloc[:,9].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'review_title'] = 'No_review_title'

#skin_tone
mask = review_df_5.iloc[:,10].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'skin_tone'] = 'No_profile'

#eye_color
mask = review_df_5.iloc[:,11].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'eye_color'] = 'No_profile'

#skin_type
mask = review_df_5.iloc[:,12].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'skin_type'] = 'No_profile'

#hair_color
mask = review_df_5.iloc[:,13].apply(lambda x: isinstance(x, float))
review_df_5.loc[mask,'hair_color'] = 'No_profile'

2.1.6 Combine All Datasets

After all the corrections, we combine all our partial review datasets.

Additionally, we assign all the texts a unique ID.

Code
df_lis = [review_df_1, review_df_2, review_df_3, review_df_4, review_df_5]

review_df_all = pd.concat(df_lis, ignore_index=True)

# assign each review a unique ID
review_df_all['review_id'] = pd.factorize(review_df_all['review_text'])[0]

review_df_all.head()
author_id rating is_recommended helpfulness total_feedback_count total_neg_feedback_count total_pos_feedback_count submission_time review_text review_title skin_tone eye_color skin_type hair_color product_id product_name brand_name price_usd review_id
0 1741593524 5 1.0 1.0 2 0 2 2023-02-01 I use this with the Nudestix “Citrus Clean Bal... Taught me how to double cleanse! No_profile brown dry black P504322 Gentle Hydra-Gel Face Cleanser NUDESTIX 19.0 0
1 31423088263 1 0.0 NaN 0 0 0 2023-03-21 I bought this lip mask after reading the revie... Disappointed No_profile No_profile No_profile No_profile P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 1
2 5061282401 5 1.0 NaN 0 0 0 2023-03-21 My review title says it all! I get so excited ... New Favorite Routine light brown dry blonde P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 2
3 6083038851 5 1.0 NaN 0 0 0 2023-03-20 I’ve always loved this formula for a long time... Can't go wrong with any of them No_profile brown combination black P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 3
4 47056667835 5 1.0 NaN 0 0 0 2023-03-20 If you have dry cracked lips, this is a must h... A must have !!! light hazel combination No_profile P420652 Lip Sleeping Mask Intense Hydration with Vitam... LANEIGE 24.0 4

2.2 Review Duplicate

After assigning a unique text ID, we notice that a single product might have multiple repeated reviews. Below are some examples:

Code
mask = review_df_all['review_id'].value_counts()>1
duplicate_idx = review_df_all['review_id'].value_counts()[mask].index
duplicate_idx = duplicate_idx[1:] # filter NA (entried as 'No_review')

duplcate_df = review_df_all[review_df_all['review_id'].isin(duplicate_idx)]
print(duplcate_df.groupby(['product_id', 'review_id'])['review_id'].agg(['count']).sort_values(by = 'count', ascending=False).head())
                      count
product_id review_id       
P377368    782703        59
P384537    730674        14
P139000    292121        13
P122661    619906        13
P384537    730673         9
Code
review_df_all.loc[review_df_all['review_id'].value_counts()>1]
review_id
8600       True
782703     True
158728     True
730674     True
694377     True
          ...  
969399    False
969400    False
969401    False
969402    False
969403    False
Name: count, Length: 969420, dtype: bool

What we want is a single product ID corresponds to a unqiue review ID. Let’s remove those duplicates. And our strategy is to keep the first occurence.

Code
mask = review_df_all["review_id"].isin(duplicate_idx)
review_df_all_deduplicate = review_df_all.loc[mask].drop_duplicates(subset=['product_id', 'review_id'], keep='first')

Check again.

Code
print('de-duplicates data frame: ', '\n',
      review_df_all_deduplicate.groupby(['product_id', 'review_id'])['review_id'].agg(['count']).value_counts()
)
de-duplicates data frame:  
 count
1        241827
Name: count, dtype: int64

2.3 NA Value

Empty review entries are also excluded.

Check any empty text entries, which we have labeled them as ‘No_review’.

Code
print('number of review entered as \'No_review\': ', (review_df_all_deduplicate['review_text'] == 'No_review').sum())
number of review entered as 'No_review':  0

3. EDA

Now we can do the EDA to better understand our dataset.

3.1 Product Level

Code
product_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8494 entries, 0 to 8493
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   product_id          8494 non-null   object 
 1   product_name        8494 non-null   object 
 2   brand_id            8494 non-null   int64  
 3   brand_name          8494 non-null   object 
 4   loves_count         8494 non-null   int64  
 5   rating              8216 non-null   float64
 6   reviews             8216 non-null   float64
 7   size                6863 non-null   object 
 8   variation_type      7050 non-null   object 
 9   variation_value     6896 non-null   object 
 10  variation_desc      1250 non-null   object 
 11  ingredients         7549 non-null   object 
 12  price_usd           8494 non-null   float64
 13  value_price_usd     451 non-null    float64
 14  sale_price_usd      270 non-null    float64
 15  limited_edition     8494 non-null   int64  
 16  new                 8494 non-null   int64  
 17  online_only         8494 non-null   int64  
 18  out_of_stock        8494 non-null   int64  
 19  sephora_exclusive   8494 non-null   int64  
 20  highlights          6287 non-null   object 
 21  primary_category    8494 non-null   object 
 22  secondary_category  8486 non-null   object 
 23  tertiary_category   7504 non-null   object 
 24  child_count         8494 non-null   int64  
 25  child_max_price     2754 non-null   float64
 26  child_min_price     2754 non-null   float64
dtypes: float64(7), int64(8), object(12)
memory usage: 1.7+ MB

First we want to know:

How many brands in this product dataset? How many products in this product dataset?

Code
print('Brand number: ', len(product_df['brand_name'].unique()))
print('Product number: ', len(product_df['product_name'].unique()))
Brand number:  304
Product number:  8415

3.1.1 rating

Now, let’s explore the average rating of the products. Notice that the value of rating is continuously distributed in [0, 5]. We discretize the variable first.

Code
product_df['rating'].head()
0    3.6364
1    4.1538
2    4.2500
3    4.4762
4    3.2308
Name: rating, dtype: float64

Plot distribution.

Code
bins = [0, 1, 2, 3, 4, 5]
labels = ["(0,1]", "(1,2]", "(2,3]", "(3,4]", "(4,5]"]
product_df["rating_interval"] = pd.cut(product_df["rating"], bins=bins, right=True, labels=labels, include_lowest=False)

ax = sns.barplot(product_df["rating_interval"].value_counts().sort_index())
ax.bar_label(ax.containers[0], fontsize = 10)
ax.set_title('Distribution of Product Rating')
plt.grid(True)
plt.show()

Most of the products are at or above the average level. Let’s also draw a suburst chart to explore the relation between rating and First product category, Second product category , Third product category.

Code
df_plot = product_df[~ product_df['tertiary_category'].isna()] # the chart is not robust to na value

print('Number of rating NA value: ', product_df['tertiary_category'].isna().sum())

fig = px.sunburst(
    df_plot,
    path = ['primary_category', 'secondary_category', 'tertiary_category'],
    color = 'rating',
    maxdepth=-1,
    labels={'rating': 'Product Rating'}
)

fig.update_layout(
    width = 1200,
    height = 900,
    title={
        'text':"Proudct Category and Rating",
        'x': 0.47,
        'y': 0.95,
        'xanchor': 'center'
    },
    margin=dict(t=70)
)

fig.show()
Number of rating NA value:  990
Unable to display output for mime type(s): application/vnd.plotly.v1+json

Many sub-categories under Fragrance section show deep color, indicating significantly lower ratings compared to the bright orange/yellow. The users are more easily dissatisfied with Sephora’s fragrance-type product offerings.

Skincare (especially “Moisturizers” and “Face Serums”) is dominated by bright yellow and light orange hues, suggesting consistently high customer satisfaction. Makeup shows darker orange and reddish tones. This indicates while the makeup category has a massive product volume, it is more likely to receive lower ratings, whereas Skincare serves as the reliable, high-quality backbone of the catalog.

After exploring the relation between rating and product categories, how about the product price and product categories?

3.1.2 price

Code
fig = px.sunburst(
    df_plot,
    path = ['primary_category', 'secondary_category', 'tertiary_category'],
    color = 'price_usd',
    maxdepth=-1,
    labels={'price_usd': 'Product Price'}
)

fig.update_layout(
    width = 1200,
    height = 900,
    title={
        'text':"Proudct Category and Price",
        'x': 0.47,
        'y': 0.95,
        'xanchor': 'center'
    },
    margin=dict(t=70)
)

fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json

Notice that there is a very thin but distinct labeled “High Tech Tools” (and nearby “Anti-Aging”) that appears bright yellow, this indicates while the majority of the catalog is affordable, this specific sub-category represents the premium price ceiling of Sephora’s inventory, standing out sharply against the rest of the dark purple chart.

The massive makeup section (eye, face, lip …) is almost entirely deep purple, indicating a consistently low price point compared to other categories. Makeup unlike skincare or gragrance which have more price variation, has a low barrier to entry cost-wise, it is an accessible category.

By contrast, the overall fragrance section is colored in a lighter tone, indicating that the fragrance products generally sit in a “mid-to-high” price tier.

How about we summarise all the numerical variables in the product dataset?

3.1.3 All Numerical Features

Code
df_plot = product_df[~product_df['reviews'].isna()]

fig = px.parallel_coordinates(
    df_plot.iloc[:,[4, 5, 6, 12]], 
    color="price_usd", 
    color_continuous_scale=px.colors.diverging.Spectral, 
    labels={'price_usd': 'Product Price'},
)

fig.update_layout(
    title={
        'text': "Product Price against Product Numeric Features",
        'x': 0.5,
        'y': 0.99,
        'xanchor': 'center'
    },
    margin=dict(t=80)
)

fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json

As the price peaks at the far right, the line dips to near zero on both the reviews and loves_count axes (blue line). There is a clear trade-off between price and engagement. Ultra-premium products (likely “high tech tools” category) generate almost no community buzz compared to the rest of the catalog, likely due to low sales volume.

We see that the red and rurple lines (low price products), they dominate the top peaks of the loves_count axis (reaching 1.4M+) and the reviews axis (reaching 20k+). This shows customers are infavor of cheaper products, the virality is exclusive to the lower price tier.

How about the non-numeric feature?

3.1.4 highlights

Highlights of a product recorded is a list of tags or features that highlight the product’s attributes (e.g. [‘Vegan’, ‘Matte Finish’]).

We extract the top 20 most common keywords across all the highlights of products. Then compare their rating, loves_count, and reviews.

This envolves the creating of uni-gram. We will write a function for creating n-grams. And note before generating the unigram, cleaned text is required, so we will also operate necessary text cleaning for the texts in highlights column.

Code
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]
Code
product_df_1 = product_df.copy()
df_highlights = product_df_1[~ product_df_1['highlights'].isna()]

# warning triggered
df_highlights['highlights_list'] = df_highlights['highlights'].apply(ast.literal_eval)
df_highlights['highlights_list'] = df_highlights['highlights_list'].apply(lambda x: ' '.join(x))

df_highlights['highlights_list'] = df_highlights['highlights_list'].str.replace(r'(?<!\w)-(?!\w)', ' ', regex=True)
df_highlights['highlights_list'] = df_highlights['highlights_list'].str.replace(r'[^\w\s-]', ' ', regex=True)
df_highlights['highlights_list'] = df_highlights['highlights_list'].str.strip()
df_highlights['highlights_list'] = df_highlights['highlights_list'].str.split()
df_highlights['highlights_list'] = df_highlights['highlights_list'].apply(lambda x: ' '.join(x))

df_highlights['unigram'] = df_highlights['highlights_list'].apply(lambda x: generate_ngrams(x, n_gram=1))

unigrams = defaultdict(int)

for row in df_highlights['unigram']:
    for word in row:
        unigrams[word] += 1

df_unigram = pd.DataFrame(sorted(unigrams.items(), key=lambda x: x[1])[::-1])
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:10: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:11: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:12: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\1994894368.py:14: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Plot the corresponding boxplots.

Note: for reviews and loves_count, we apply \(x \rightarrow log(1+x)\) transformation, otherwise the plots will be too skewed.

Code
rows = list(df_unigram.loc[df_unigram.iloc[:, 1] >= 900, 0]) # Top 20
col_names = ["rating", "reviews", "loves_count"]

for col in col_names:
    plot_data = []
    for word in rows:
        sub = df_highlights[df_highlights["highlights_list"]
                            .str.contains(word, case=False, regex=False)][col]
        vals = np.log1p(sub) if col != "rating" else sub
        plot_data.append(pd.DataFrame({
            "word": word,
            col: vals
        }))
    plot_df = pd.concat(plot_data, ignore_index=True)

    plt.figure(figsize=(12, 6))
    sns.boxplot(data=plot_df, x="word", y=col, color="skyblue")
    plt.xticks(rotation=45, ha="right")
    plt.xlabel("keyword")
    plt.ylabel(col if col == "rating" else f"log (1 + {col})")
    plt.title(f"{col} by Top-20 highlights keyword")
    plt.tight_layout()
    plt.show()

Based on the 3 plots, there is no much variation across the keywords, highlight keyword is not the sole factor impacting the numercial metrics here.

3.2 Review Level

First, how many customers posted at least 1 review?

Code
print('Unique customers: ', len(review_df_all_deduplicate['author_id'].unique()))
Unique customers:  96666

Then, what does the time interval between posted reviews look like?

3.2.1 Time Range

Code
review_df_all_deduplicate['submission_year'] = pd.to_datetime(review_df_all_deduplicate['submission_time']).dt.year
review_df_all_deduplicate['submission_month'] = pd.to_datetime(review_df_all_deduplicate['submission_time']).dt.month

plot_data = review_df_all_deduplicate.loc[:, ['submission_year', 'submission_month']]
plot_data = plot_data.groupby(["submission_year", "submission_month"]).size().reset_index(name="count")

fig = px.sunburst(
    plot_data,
    path=["submission_year", "submission_month"],
    values="count",
    color="count",
    color_continuous_scale="YlOrRd",
    labels={"count": "Reviews Count by Month"},
)

fig.update_layout(
    width = 800,
    height = 900,
    title={
        'text':"Reviews Count",
        'x': 0.47,
        'y': 0.95,
        'xanchor': 'center'
    },
    margin=dict(t=70)
)

fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json

Note: ignore the color presentation (difference) in the middle pie chart.

The reviews span from 2008 to 2023. Among these years, 2020, 2021, and 2022 account for a large proportion of the reviews. In addition, reviews peak in May, June, and August during these years.

Since we are dealing with unstructured text data, N-gram analysis is essential to capture recurring phrase patterns and understand the linguistic structure of the reviews.

3.2.2 N-Gram Analysis

Create dataframe containing n-grams needed.

Code
df_gram = review_df_all_deduplicate.copy()
df_gram = df_gram[['author_id','review_id','review_text']]

df_gram['unigram'] = review_df_all_deduplicate['review_text'].apply(lambda x: generate_ngrams(x, n_gram=1))
df_gram['bigram'] = review_df_all_deduplicate['review_text'].apply(lambda x: generate_ngrams(x, n_gram=2))
df_gram['trigram'] = review_df_all_deduplicate['review_text'].apply(lambda x: generate_ngrams(x, n_gram=3))

We will be exploring unigram, bigram, and trigram.

Unigram

Most common unigrams are mostly stop words and uncleand words (with punctuations), which do not contain much information, this indicates we need further cleaning.

Code
unigrams = defaultdict(int)

for row in df_gram['unigram']:
    for word in row:
        unigrams[word] += 1

df_unigram = pd.DataFrame(sorted(unigrams.items(), key=lambda x: x[1])[::-1])

fig, ax = plt.subplots(figsize=(18, 50), dpi=100)
N = 100
sns.barplot(y = df_unigram[0][:N], x=df_unigram[1][:N])

ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Top 100 most common unigrams in customer reviews')
plt.show()

Bigram

Bigrams reveal more information than the unigrams, the most common bigrams are about skincare products, people who bought skincare category products are more likely to leave a comment. And more deatiled information about customer skin is shown, for instance, we see ‘dry skin’ and ‘sensitive skin’ are mentioned very frequently.

But most bigrams still contain stop words and uncleand words (with punctuations), we need further cleaning.

Code
bigrams = defaultdict(int)

for row in df_gram['bigram']:
    for word in row:
        bigrams[word] += 1

df_bigram = pd.DataFrame(sorted(bigrams.items(), key=lambda x: x[1])[::-1])

fig, ax = plt.subplots(figsize=(18, 50), dpi=100)

sns.barplot(y = df_bigram[0][:N], x=df_bigram[1][:N])

ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Top 100 most common bigrams in customer reviews')
plt.show()

Trigram

Most common trigrams are also about skin, from the most common trigram we see customers care long-lasting effects of the prodcuts.

The same stop words and uncleand words issues exist in this plot, we need further cleaning.

Code
trigrams = defaultdict(int)

for row in df_gram['trigram']:
    for word in row:
        trigrams[word] += 1

df_trigram = pd.DataFrame(sorted(trigrams.items(), key=lambda x: x[1])[::-1])

fig, ax = plt.subplots(figsize=(18, 50), dpi=100)

sns.barplot(y = df_trigram[0][:N], x=df_trigram[1][:N])

ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Top 100 most common trigrams in customer reviews')
plt.show()

This is enough for EDA, now as the above N-Gram analysis indicated, we proceed to clean texts before word embedding.

4. Text Cleaning

4.1 Language Detection

This dataset contains multiple language product reviews, examples are Spanish, French, Chinese, and etc., so our first thing to do is to detect English-only reviews.

We will use two language detection tools: langdetect and langid, the first one will do the detection work, and the second one will validate the detection results. Detailed scripts are omitted for brevity, interesting reader can find it here.

Here we directly import the processed dataset.

Code
review_df_all_deduplicate = pd.read_csv("dataset/review_df_all_deduplicate_english.csv", index_col = 0)

4.2 Lowercase Letter

Code
review_df_all_deduplicate['review_text'] = review_df_all_deduplicate['review_text'].str.lower()

4.3 Standardize Unicode Character

After detecting the non-English reviews, there are other Unicode variations of standard punctuation marks, we clean them by mapping them to standard English punctuation characters.

Code
def clean_unicode(text):
    text = unicodedata.normalize('NFKC', text)

    replace_map = {
        "’": "'", "‘": "'", "´": "'", "`": "'",     
        "“": '"', "”": '"',                         
        "–": "-", "—": "-", "-": "-", "−": "-",     

        "。": ".", "、": ",", "・": " ", 
    }

    for old, new in replace_map.items():
        text.replace(old, new)

    return text

review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text'].map(clean_unicode)

4.4 Non-ASCII Character

This is similar to Unicode, but here we will remove those unmeaningful non-ASCII characters.

Non-ASCII characters are such as non-English characters ‘å’, emojis ‘🙂’, and the characters that cannot be displayed ‘028’. However, we will only remain emojis, since the later embedding model can identify and convert emojis to correct numerical vectors.

Code
# English letters + number + ASCII punctuation
ascii_allowed = r"[A-Za-z0-9\s\.,!?;:'\"()\-\[\]{}]"

emoji_pattern = (
    r"[\U0001F1E0-\U0001F1FF"  # Flags
    r"\U0001F300-\U0001F5FF"   # Symbols & pictographs
    r"\U0001F600-\U0001F64F"   # Emoticons
    r"\U0001F680-\U0001F6FF"   # Transport & map symbols
    r"\U0001F700-\U0001F77F"   # Alchemical symbols
    r"\U0001F780-\U0001F7FF"   # Geometric symbols
    r"\U0001F800-\U0001F8FF"   # Supplemental arrows
    r"\U0001F900-\U0001F9FF"   # Supplemental symbols & pictographs
    r"\U0001FA00-\U0001FA6F"   # Chess, symbols
    r"\U0001FA70-\U0001FAFF"   # Emoji components
    r"\U00002702-\U000027B0"   # Dingbats
    r"\U000024C2-\U0001F251"   # Enclosed characters
    r"]"
)

allowed = f"(?:{ascii_allowed}|{emoji_pattern})"

def clean_text_keep_emoji(text):
    return "".join(char if re.match(allowed, char) else "" for char in text)

review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].apply(lambda x: clean_text_keep_emoji(x))
Code
punc_lis_remain = []


for text in review_df_all_deduplicate['review_text_cleaned']:
    punc_lis_remain.extend(re.findall(r"[^\w\s]", text))

punc_lis_remain = sorted(set(punc_lis_remain), key=lambda ch: ord(ch))
punc_lis_remain = [p for p in punc_lis_remain if not p.isascii()]

print('some non-ASCII characters remained: ', punc_lis_remain)

punc_lis_remain = np.array(punc_lis_remain)[[0,1, 2,3, 5, 6, 7, 8, 9, 10, 13, 16, 17, 18, 20, 21, 22, 23, 24, 25, 30, 31, 32, 33]]

pat = "[" + re.escape("".join(punc_lis_remain)) + "]"
review_df_all_deduplicate["review_text_cleaned"] = review_df_all_deduplicate["review_text_cleaned"].str.replace(pat, "", regex=True)

print('remaining cleaned')
some non-ASCII characters remained:  ['◡', '◼', '☀', '☆', '☝', '☹', '☺', '♀', '♡', '♥', '♻', '⚡', '✅', '✔', '✨', '❌', '❣', '❤', '⠀', '⭐', '、', '。', '・', '\uf0a7', '️', '', '🌺', '🍅', '🍞', '🍯', '🏻', '🏼', '🏽', '🏾', '🐐', '👌', '👍', '👎', '👏', '💔', '💕', '💗', '💙', '💚', '💜', '💯', '🔥', '😁', '😂', '😃', '😅', '😆', '😉', '😊', '😍', '😏', '😔', '😖', '😞', '😢', '😫', '😬', '😭', '🙁', '🙂', '🙃', '🙌', '🙏', '🤍', '🤓', '🤦', '🤩', '🤯', '🤷', '🥰', '🥶', '🥹', '🥺', '🧼', '🪄', '🫶']
remaining cleaned

4.5 Abbreviation

Two types of abbreviation:

  1. Contraction e.g.: i’ve and don’t, …

  2. Abbreviation:

    • slang/Internet slang, e.g.: idk, lol, omg, …

    • specific terms, e.g: aha, bha, …

For type 1:

We detect 0.

Code
df_gram_abbreviation = review_df_all_deduplicate.copy()
df_gram_abbreviation = pd.DataFrame(df_gram_abbreviation['review_text_cleaned'])
df_gram_abbreviation['unigram'] = df_gram_abbreviation['review_text_cleaned'].apply(lambda x: generate_ngrams(x, n_gram=1))

all_tokens = (tok for lst in df_gram_abbreviation["unigram"] for tok in lst)

pattern = re.compile(r"[']")
abbr_tokens = [t for t in all_tokens if pattern.search(t)]

print('Number of tokens that contain \': ', len(abbr_tokens))
Number of tokens that contain ':  0

For type 2:

Code
all_tokens = (tok for lst in df_gram_abbreviation["unigram"] for tok in lst)
short_tokens = [t for t in all_tokens if len(t) < 4]
short_freq = Counter(short_tokens)

short_freq.most_common(300) # top 300
[('use', 75711),
 ('ive', 46001),
 ('im', 42936),
 ('dry', 42025),
 ('it.', 35990),
 ('one', 33303),
 ('try', 21790),
 ('see', 20136),
 ('got', 17914),
 ('day', 16768),
 ('oil', 16691),
 ('bit', 16180),
 ('now', 15240),
 ('way', 14232),
 ('say', 12401),
 ('me.', 11905),
 ('lot', 11704),
 ('put', 11564),
 ('it!', 11331),
 ('-', 11190),
 ('buy', 10941),
 ('go', 10743),
 ('new', 10566),
 ('it,', 9881),
 ('two', 7908),
 ('eye', 7736),
 ('2', 7628),
 ('far', 6437),
 ('3', 6402),
 ('ill', 6132),
 ('job', 6073),
 ('bad', 5985),
 ('id', 5367),
 ('red', 5197),
 ('.', 4840),
 ('big', 4740),
 ('top', 4654),
 ('may', 4576),
 ('gel', 4504),
 ('saw', 4478),
 ('spf', 4344),
 ('yet', 4158),
 ('5', 4153),
 ('fan', 4017),
 ('add', 3913),
 ('up.', 3802),
 ('4', 3691),
 ('let', 3610),
 ('end', 3559),
 ('jar', 3540),
 ('rub', 3313),
 ('due', 3300),
 ('me,', 3281),
 ('!', 2964),
 (',', 2944),
 ('sun', 2903),
 ('rid', 2872),
 ('on.', 2811),
 ('(i', 2806),
 ('la', 2538),
 ('run', 2443),
 ('c', 2347),
 ('old', 2267),
 ('10', 2231),
 ('1', 2143),
 ('100', 2069),
 ('ago', 1974),
 (':)', 1940),
 ('mix', 1822),
 ('is.', 1785),
 ('me!', 1730),
 ('do.', 1694),
 ('mer', 1611),
 ('bed', 1597),
 ('non', 1520),
 ('in.', 1513),
 ('on,', 1491),
 ('up,', 1480),
 ('set', 1434),
 ('30', 1424),
 ('box', 1407),
 ('pad', 1347),
 ('6', 1321),
 ('wow', 1269),
 ('ok', 1243),
 ('to.', 1226),
 ('bc', 1211),
 ('pay', 1182),
 ('pat', 1159),
 ('jet', 1153),
 ('sit', 1152),
 ('wet', 1146),
 ('dr.', 1120),
 ('per', 1118),
 ('2-3', 1115),
 (':(', 1067),
 ('def', 1059),
 ('7', 1051),
 ('hot', 1020),
 ('lag', 1003),
 ('ran', 1002),
 ('so,', 992),
 ('t', 975),
 ('ton', 951),
 ('oh', 945),
 ('yes', 933),
 ('tag', 919),
 ('15', 890),
 ('mom', 889),
 ('is,', 883),
 ('de', 843),
 ('ptr', 821),
 ('sad', 757),
 ('20', 750),
 ('lip', 707),
 ('bha', 700),
 ('omg', 698),
 ('up!', 693),
 ('in,', 680),
 ('lol', 680),
 ('8', 675),
 ('aha', 672),
 ('age', 652),
 ('(', 650),
 ('pea', 649),
 ('!!', 644),
 ('oz', 629),
 ('12', 629),
 ('pm', 628),
 ('kit', 623),
 ('50', 609),
 ('u', 599),
 ('bb', 597),
 ('40', 588),
 ('go.', 575),
 ('--', 574),
 ('tub', 571),
 ('low', 570),
 ('spa', 555),
 ('fix', 550),
 ('us', 547),
 ('ok.', 545),
 ('do,', 532),
 ('cut', 531),
 ('air', 521),
 ('...', 511),
 ('be.', 510),
 ('hit', 507),
 ('w', 504),
 ('dab', 501),
 ('3-4', 491),
 ('fun', 481),
 ('soo', 479),
 ('so.', 474),
 ('odd', 466),
 ('1-2', 457),
 ('key', 452),
 ('ha', 444),
 ('30s', 440),
 ('2nd', 438),
 ('sat', 436),
 ('ask', 434),
 ('cap', 433),
 ('tea', 432),
 ('3rd', 429),
 ('tan', 427),
 ('mid', 422),
 ('is!', 419),
 ('(it', 416),
 ('boy', 414),
 ('cc', 408),
 ('tad', 401),
 ('fit', 399),
 ('bag', 389),
 ('(im', 389),
 ('con', 386),
 ('lid', 379),
 ('of.', 375),
 ('2x', 371),
 ('dr', 368),
 ('ok,', 367),
 ('..', 367),
 ('45', 360),
 (':', 357),
 ('win', 352),
 ('oat', 348),
 ('1st', 346),
 ('via', 345),
 ('(or', 336),
 ('on!', 336),
 ('to,', 335),
 ('idk', 333),
 ('dew', 332),
 ('(a', 331),
 ('ur', 330),
 ('!!!', 329),
 ('no.', 325),
 ('(my', 324),
 ('god', 319),
 ('(in', 309),
 ('(no', 295),
 ('tho', 293),
 ('pop', 289),
 ('fav', 284),
 ('ole', 281),
 ('pre', 280),
 ('(as', 277),
 ('ceo', 270),
 ('hg', 266),
 ('ren', 265),
 ('aid', 261),
 ('vs', 257),
 ('(if', 257),
 ('yo', 256),
 ('n', 254),
 ('etc', 252),
 ('20s', 251),
 ('60', 248),
 ('ten', 247),
 ('24', 246),
 ('vib', 239),
 ('jaw', 237),
 ('bay', 237),
 ('am.', 235),
 ('to!', 233),
 ('min', 233),
 ('eat', 232),
 ('man', 230),
 ('1.', 224),
 ('six', 223),
 ('14', 223),
 ('bye', 220),
 ('e', 219),
 ('do!', 219),
 ('4-5', 218),
 ('die', 217),
 ('90', 216),
 ('34', 212),
 ('tip', 212),
 ('25', 209),
 ('ph', 208),
 ('3x', 206),
 ('35', 200),
 ('pot', 198),
 (')', 197),
 ('lil', 196),
 ('re', 195),
 ('s', 195),
 ('soy', 194),
 ('mad', 191),
 ('it?', 191),
 ('go!', 189),
 ('2.', 187),
 ('row', 186),
 ('95', 186),
 ('no,', 183),
 ('pod', 182),
 ('o', 180),
 ('dip', 178),
 ('itd', 178),
 ('9', 177),
 ('met', 177),
 ('4th', 175),
 ('ml', 175),
 ('zit', 174),
 ('300', 174),
 ('cuz', 172),
 ('vit', 170),
 ('13', 169),
 ('0', 169),
 ('c,', 168),
 ('bar', 167),
 ('(at', 167),
 ('(so', 167),
 ('c.', 167),
 ('sk', 166),
 ('inc', 166),
 ('son', 165),
 ('40s', 164),
 ('pro', 159),
 ('raw', 156),
 ('pha', 156),
 ('pm.', 154),
 ('ii', 153),
 ('meh', 151),
 ('it)', 151),
 ('tap', 151),
 ('duo', 150),
 ('fab', 149),
 ('uv', 146),
 ('30.', 145),
 ('ups', 145),
 ('oz.', 144),
 ('36', 142),
 ('act', 141),
 ('3.', 141),
 ('tug', 140),
 ('28', 140),
 ('vox', 140),
 ('2)', 139)]

We standardized abbreviations by examining the top 300 most frequent tokens with a length of less than 4 characters. While these thresholds (Top-300, Length < 4) are heuristic, this approach effectively captures the majority of high-frequency cases without requiring exhaustive manual review.

Create correction mapping.

Code
correct_map = {
    r"\bive\b": "i have",
    r"\bim\b": "i am",
    r"\bill\b": "i will",
    r"\bspf\b": "sun protection factor",
    r"\bbc\b": "because",
    r"\bdr\.\b": "doctor",
    r"\bdr\b": "doctor",
    r"\bdef\b": "definitely",
    r"\bptr\b": "peter thomas roth",
    r"\bbha\b": "beta hydroxy acid",
    r"\baha\b": "alpha hydroxy acid",
    r"\bbb\b": "beauty balm",
    r"\bha\b": "hyaluronic acid",
    r"\bomg\b": "oh my god",
    r"\blol\b": "laugh out loud",
    r"\bidk\b": "i do not know",
    r"\bur\b": "your",
    r"\btho\b": "though",
    r"\bfav\b": "favorite",
    r"\bceo\b": "sunday riley",
    r"\bhg\b": "holy grail",
    r"\betc\b": "et cetera",
    r"\bvib\b": "very important beauty insider",
    r"\bmin\b": "minutes",
    r"\bph\b": "potential of hydrogen",
    r"\bmeh\b": "eh",
    r"\bfab\b": "first aid beauty",
    r"\buv\b": "ultraviolet",
    r"\bitd\b": "it would",
    r"\bcuz\b": "because",
    r"\bvit\b": "vitamin",
    r"\b30s\b": "30 seconds",
    r"\b20s\b": "20 seconds",
    r"\b40s\b": "40 seconds",
    r"\b1st\b": "first",
    r"\b2nd\b": "second",
    r"\b3rd\b": "third",
    r"\b4th\b": "fourth",
    r"\b3x\b": "3 times",
    r"\blil\b": "little",
    # latter added
    r"\bwouldnt\b": "would not",
    r"\bhavec\b": "have",
    r"\bbomb.com\b": "great",
    r"\byrs\b": "years",
    r"\bfyi\b": "for your information",
    r"\brn\b": "right now"
}

review_df_all_deduplicate['review_text_cleaned'] = (
    review_df_all_deduplicate['review_text_cleaned']
    .replace(correct_map,regex=True)
)
Code
COMMON_VERBS_AFTER_I = {
    "love","like","think","use","used","feel","felt","have","had","am",
    "would","will","wish","want","tried","try","bought","buy",
    "recommend","ordered","order","see","saw","noticed","notice",
    "find","found","hate","dislike","prefer","need","needed",

    # latter added
    "mean", "only", "also", "heard", 
}

def fix_ii_typos(text):

    tokens = text.split()
    new_tokens = []

    for idx, tok in enumerate(tokens):
        lower_tok = tok.lower()

        if lower_tok == "ii":
            prev_tok = tokens[idx-1].lower() if idx > 0 else ""
            next_tok = tokens[idx+1].lower() if idx+1 < len(tokens) else ""

            if prev_tok.isdigit():
                new_tokens.append(tok)

            elif next_tok in COMMON_VERBS_AFTER_I:

                new_tokens.append("i")
            else:

                new_tokens.append(tok)
        else:
            new_tokens.append(tok)

    return " ".join(new_tokens)

mask = review_df_all_deduplicate['review_text_cleaned'].str.contains(r'\bii\b', regex=True)
review_df_all_deduplicate.loc[mask, 'review_text_cleaned'] = review_df_all_deduplicate.loc[mask, 'review_text_cleaned'].apply(fix_ii_typos)

4.6 URL

Clean urls in reviews.

Code
def clean_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+|\bhttps?\b') # enhanced, remove residual 'http'
    
    return url.sub('',text).strip()

review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].map(clean_URL)
review_df_all_deduplicate['review_text_cleaned'] = review_df_all_deduplicate['review_text_cleaned'].str.replace(r"\b[\w.-]+\.com\b", "", regex=True)

4.7 Letter Repeat

Clean the words that contain multiple repeated letters (e.g., loooooooove, soooooooo …)

  1. reduce any character sequence exceeding a length of 3 down to 2 characters to restore standard spelling.

  2. Find and analyze the top 500 most frequent words containing repeated characters to identify and handle remaining irregularities.

Code
mask = review_df_all_deduplicate['review_text_cleaned'].str.contains(r"([A-Za-z])\1{2,}", regex=True)


review_df_all_deduplicate.loc[mask, 'review_text_cleaned'] = (
    review_df_all_deduplicate.loc[mask, 'review_text_cleaned']
    .str.replace(r"([A-Za-z])\1{2,}", r"\1\1", regex=True)
)
C:\Users\fuyuz\AppData\Local\Temp\ipykernel_4144\277250214.py:1: UserWarning:

This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
Code
df_gram_repeat = review_df_all_deduplicate.loc[mask, 'review_text_cleaned'].copy()
df_gram_repeat = pd.DataFrame(df_gram_repeat)
df_gram_repeat['unigram'] = df_gram_repeat['review_text_cleaned'].apply(lambda x: generate_ngrams(x, n_gram=1))
Code
all_tokens = (tok for lst in df_gram_repeat["unigram"] for tok in lst)

short_tokens = [t for t in all_tokens if re.search(r"([A-Za-z])\1", t)] 
short_freq = Counter(short_tokens)

short_freq.most_common(500) # top 500
[('soo', 2280),
 ('really', 1454),
 ('will', 1299),
 ('feel', 1103),
 ('feels', 997),
 ('little', 980),
 ('good', 797),
 ('smells', 552),
 ('feeling', 514),
 ('recommend', 508),
 ('look', 493),
 ('see', 445),
 ('well', 437),
 ('smell', 426),
 ('still', 423),
 ('need', 416),
 ('smooth', 411),
 ('full', 394),
 ('looking', 367),
 ('stuff', 360),
 ('looks', 355),
 ('actually', 333),
 ('week', 325),
 ('better', 319),
 ('lovee', 296),
 ('bottle', 294),
 ('difference', 272),
 ('literally', 264),
 ('pretty', 259),
 ('loove', 257),
 ('apply', 253),
 ('redness', 253),
 ('weeks', 238),
 ('happy', 235),
 ('free', 234),
 ('sunscreen', 232),
 ('looked', 229),
 ('less', 220),
 ('getting', 218),
 ('usually', 217),
 ('small', 209),
 ('keep', 206),
 ('well.', 185),
 ('all.', 170),
 ('different', 163),
 ('especially', 159),
 ('good.', 154),
 ('off.', 152),
 ('took', 152),
 ('tell', 149),
 ('finally', 141),
 ('loong', 135),
 ('overall', 119),
 ('totally', 112),
 ('smooth.', 111),
 ('essence', 110),
 ('wayy', 109),
 ('applying', 108),
 ('waterproof', 104),
 ('green', 102),
 ('good!', 102),
 ('irritate', 99),
 ('better.', 98),
 ('summer', 98),
 ('recommended', 95),
 ('add', 92),
 ('seen', 91),
 ('too.', 90),
 ('applied', 90),
 ('recommend!', 87),
 ('keeps', 83),
 ('deep', 82),
 ('putting', 81),
 ('bottle.', 78),
 ('immediately', 76),
 ('well,', 72),
 ('needed', 72),
 ('off,', 70),
 ('seems', 70),
 ('glass', 70),
 ('gonna', 69),
 ('good,', 69),
 ('normally', 68),
 ('cheeks', 68),
 ('recommend.', 66),
 ('all,', 66),
 ('effective', 65),
 ('added', 64),
 ('irritation', 64),
 ('irritated', 64),
 ('three', 64),
 ('soon', 64),
 ('personally', 64),
 ('waay', 64),
 ('issues', 63),
 ('smaller', 62),
 ('smell.', 61),
 ('difference.', 61),
 ('clogged', 60),
 ('stuff.', 60),
 ('obsessed', 60),
 ('gotten', 59),
 ('cotton', 58),
 ('rubbing', 56),
 ('sunscreen.', 56),
 ('seeing', 56),
 ('follow', 54),
 ('smooth,', 54),
 ('seem', 51),
 ('smell,', 50),
 ('bigger', 49),
 ('too!', 48),
 ('matter', 47),
 ('matte', 46),
 ('well!', 46),
 ('stopped', 46),
 ('suuper', 46),
 ('impressed', 45),
 ('struggle', 45),
 ('yttp', 44),
 ('unless', 44),
 ('stuff!', 43),
 ('supposed', 43),
 ('sunscreen,', 42),
 ('weeks.', 42),
 ('peel', 40),
 ('overall,', 40),
 ('fell', 40),
 ('guess', 40),
 ('redness,', 40),
 ('week.', 40),
 ('seemed', 39),
 ('adding', 39),
 ('micellar', 38),
 ('peeling', 38),
 ('stripped', 38),
 ('youll', 38),
 ('basically', 38),
 ('disappointed', 38),
 ('stripping', 37),
 ('ahh', 37),
 ('immediate', 37),
 ('stubborn', 36),
 ('supple', 36),
 ('irritating', 36),
 ('struggled', 36),
 ('longg', 36),
 ('essential', 36),
 ('sleep', 36),
 ('nighttime', 35),
 ('effect', 35),
 ('appreciate', 34),
 ('redness.', 34),
 ('smoother', 34),
 ('smooth!', 34),
 ('better!', 33),
 ('application', 33),
 ('needs', 33),
 ('dryness', 32),
 ('issue', 32),
 ('followed', 32),
 ('all!', 32),
 ('effects', 32),
 ('cool', 32),
 ('sleeping', 32),
 ('sk-ii', 32),
 ('week,', 31),
 ('dull', 31),
 ('naturally', 30),
 ('itll', 30),
 ('gross', 30),
 ('flawless', 30),
 ('sunscreens', 30),
 ('horrible', 30),
 ('better,', 29),
 ('worried', 28),
 ('smelled', 28),
 ('pass', 28),
 ('excess', 28),
 ('feeling.', 28),
 ('supple.', 28),
 ('stress', 28),
 ('reallyy', 26),
 ('fully', 26),
 ('off!', 26),
 ('biggest', 26),
 ('irritation.', 26),
 ('wanna', 26),
 ('appearance', 26),
 ('keeping', 26),
 ('barrier', 26),
 ('application.', 26),
 ('happen', 25),
 ('sweet', 24),
 ('massage', 24),
 ('suggest', 24),
 ('suffer', 24),
 ('look.', 24),
 ('across', 24),
 ('smoother,', 24),
 ('afford', 24),
 ('typically', 24),
 ('soothing', 23),
 ('annoying', 23),
 ('impressed.', 23),
 ('mirror', 23),
 ('cooling', 23),
 ('squeeze', 23),
 ('effort', 22),
 ('ohh', 22),
 ('buut', 22),
 ('scarring', 22),
 ('bottle,', 22),
 ('weeks,', 22),
 ('difficult', 22),
 ('yall', 22),
 ('sulwhasoo', 22),
 ('penny.', 21),
 ('affordable', 21),
 ('running', 21),
 ('addition', 21),
 ('accutane', 20),
 ('loovvee', 20),
 ('whipped', 20),
 ('supergoop', 20),
 ('free.', 20),
 ('lovee.', 20),
 ('good!!', 20),
 ('effective.', 20),
 ('deeply', 20),
 ('happy.', 20),
 ('fall', 20),
 ('smallest', 20),
 ('cheeks.', 20),
 ('tree', 20),
 ('initially', 20),
 ('generally', 20),
 ('process', 20),
 ('cells', 20),
 ('pills', 20),
 ('balls', 20),
 ('massaging', 20),
 ('specifically', 20),
 ('ball', 20),
 ('terrible', 19),
 ('too,', 19),
 ('reeally', 18),
 ('rubbed', 18),
 ('feel.', 18),
 ('suggested', 18),
 ('refill', 18),
 ('agree', 18),
 ('originally', 18),
 ('allergic', 18),
 ('struggling', 18),
 ('correcting', 18),
 ('bottom', 18),
 ('hopefully', 18),
 ('currently', 18),
 ('amazingg', 18),
 ('reapply', 17),
 ('woww', 17),
 ('happened', 17),
 ('middle', 17),
 ('(especially', 16),
 ('following', 16),
 ('blurry', 16),
 ('worry', 16),
 ('effective,', 16),
 ('obsessed.', 16),
 ('smell!', 16),
 ('smelling', 16),
 ('disappointed.', 16),
 ('superr', 16),
 ('planning', 16),
 ('odd', 16),
 ('cheek', 16),
 ('obsessed!', 16),
 ('needed.', 16),
 ('dryness,', 16),
 ('aand', 16),
 ('essence.', 16),
 ('issues.', 16),
 ('amazingg.', 16),
 ('goop', 16),
 ('mess', 15),
 ('beginning', 15),
 ('applies', 15),
 ('disappoint.', 15),
 ('biossance', 15),
 ('missing', 14),
 ('apply,', 14),
 ('suppose', 14),
 ('scoop', 14),
 ('occasional', 14),
 ('noo', 14),
 ('superfood', 14),
 ('combooily', 14),
 ('peel.', 14),
 ('gotta', 14),
 ('yellow', 14),
 ('umm', 14),
 ('difference,', 14),
 ('adds', 14),
 ('school', 14),
 ('dollars', 14),
 ('dissolve', 14),
 ('immediately.', 14),
 ('pill', 14),
 ('peels', 14),
 ('terrible.', 13),
 ('smoothly', 13),
 ('estee', 13),
 ('impressed!', 13),
 ('hmm', 13),
 ('needing', 13),
 ('deff', 13),
 ('veryy', 12),
 ('recommending', 12),
 ('andd', 12),
 ('effects.', 12),
 ('clogging', 12),
 ('oiliness', 12),
 ('recommend,', 12),
 ('summer.', 12),
 ('omgg', 12),
 ('stressed', 12),
 ('irritated.', 12),
 ('grass', 12),
 ('bottles', 12),
 ('flawless.', 12),
 ('till', 12),
 ('penny', 12),
 ('essence,', 12),
 ('looking.', 12),
 ('sheen', 12),
 ('stuff,', 12),
 ('sunscreen!', 12),
 ('employee', 12),
 ('loove.', 12),
 ('smoothed', 12),
 ('smaller,', 12),
 ('call', 12),
 ('sitting', 12),
 ('loovee', 12),
 ('thankfully', 12),
 ('difference!', 12),
 ('boost', 12),
 ('hooked', 12),
 ('marianna', 12),
 ('stuff!!', 12),
 ('butter', 12),
 ('pulling', 12),
 ('dissolves', 12),
 ('moon', 12),
 ('feeling,', 12),
 ('application,', 12),
 ('looved', 12),
 ('irritates', 12),
 ('comment', 12),
 ('essences', 12),
 ('yess', 12),
 ('telling', 12),
 ('goopglow', 12),
 ('recommend!!', 11),
 ('yall,', 11),
 ('carry', 11),
 ('huuge', 11),
 ('commented', 11),
 ('soothing.', 10),
 ('yess!', 10),
 ('massaged', 10),
 ('happy!', 10),
 ('popped', 10),
 ('lovvee', 10),
 ('bottle!', 10),
 ('disappointing', 10),
 ('brightness', 10),
 ('soothe', 10),
 ('waayy', 10),
 ('really,', 10),
 ('loove!', 10),
 ('carrying', 10),
 ('applying,', 10),
 ('cc', 10),
 ('butt', 10),
 ('setting', 10),
 ('cheeks,', 10),
 ('deeper', 10),
 ('sunscreens,', 10),
 ('loonngg', 10),
 ('soon.', 10),
 ('itt.', 10),
 ('chemically', 10),
 ('foreverr', 10),
 ('gritty', 10),
 ('smoother.', 10),
 ('thicc', 10),
 ('possible', 10),
 ('verry', 10),
 ('itt', 10),
 ('grabbed', 10),
 ('specially', 10),
 ('supple,', 10),
 ('practically', 10),
 ('smaller.', 10),
 ('missed', 10),
 ('lonng', 10),
 ('everr', 10),
 ('different.', 10),
 ('willing', 10),
 ('spoon', 10),
 ('happier', 10),
 ('beautifully', 10),
 ('summer,', 10),
 ('called', 10),
 ('wallet', 10),
 ('ooh', 10),
 ('suffered', 10),
 ('glass.', 10),
 ('dullness', 10),
 ('lovedd', 10),
 ('affordable.', 10),
 ('dramatically', 10),
 ('additional', 10),
 ('immediately!', 10),
 ('eventually', 10),
 ('irritation,', 10),
 ('irritating.', 10),
 ('berry', 10),
 ('supplements', 10),
 ('dropper', 10),
 ('current', 9),
 ('collection', 9),
 ('apply.', 9),
 ('buttery', 9),
 ('pilled', 9),
 ('lott', 9),
 ('effectively', 9),
 ('soo,', 9),
 ('muchh', 9),
 ('drastically', 9),
 ('yess!!!', 8),
 ('ehh', 8),
 ('regardless', 8),
 ('kidding', 8),
 ('penny!', 8),
 ('pillow', 8),
 ('woow', 8),
 ('express', 8),
 ('iffy', 8),
 ('effortlessly', 8),
 ('tugging', 8),
 ('filled', 8),
 ('scrubbing', 8),
 ('pulled', 8),
 ('less.', 8),
 ('happening', 8),
 ('stripping.', 8),
 ('looves', 8),
 ('committing', 8),
 ('stripped.', 8),
 ('tightness', 8),
 ('thrilled', 8),
 ('feels,', 8),
 ('dryness.', 8),
 ('weekly', 8),
 ('teen', 8),
 ('pull', 8),
 ('yall.', 8),
 ('itt!', 8),
 ('patting', 8),
 ('necessary', 8),
 ('disappears', 8),
 ('look,', 8),
 ('sorry', 8),
 ('reaally', 8),
 ('sunny', 8),
 ('little.', 8),
 ('coffee', 8),
 ('free!', 8),
 ('hooked!', 8),
 ('professional', 8),
 ('supply', 8),
 ('amaazing.', 8),
 ('verryy', 8),
 ('scrubbed', 8),
 ('opposed', 8),
 ('soothes', 8),
 ('toothpaste', 8),
 ('falling', 8),
 ('tissue', 8),
 ('hella', 8),
 ('loot', 8),
 ('applied,', 8),
 ('(still', 8),
 ('need.', 8),
 ('mess.', 8),
 ('amazingg!', 8),
 ('excellent', 8)]

Create correction mapping.

Code
correct_map_1 = {
    r"\bsoo\b": "so",
    r"\bsooo\b": "so",
    r"\bsoooo\b": "so",
    r"\bsooooo\b": "so",

    r"\blovee\b": "love",
    r"\bloove\b": "love",
    r"\bloong\b": "long",
    r"\bwayy\b": "way",
    r"\bgonna\b": "going to",
    r"\bwaay\b": "way",
    r"\bsuuper\b": "super",
    r"\byttp\b": "youth to the people",
    r"\byoull\b": "you will",
    r"\blongg\b": "long",
    r"\bnighttime\b": "night time",    
    r"\bitll\b": "it will",
    r"\breallyy\b": "really", 
    r"\bwanna\b": "want to", 
    r"\bohh\b": "oh", 
    r"\bbuut\b": "but", 
    r"\byall\b": "you all",
    r"\bloovvee\b": "love",    
    r"\blovee\.\b": "love",      
    r"\bamazingg\b": "amazing",       
    r"\bwoww\b": "wow",     
    r"\bsuperr\b": "super",
    r"\baand\b": "and",
    r"\bamazingg\.\b": "amazing", 
    r"\bnoo\b": "no", 
    r"\bgotta\b": "got to", 
    r"\bdeff\b": "definitely", 
    r"\bveryy\b": "very", 
    r"\bandd\b": "and",
    r"\bomgg\b": "oh my god",
    r"\bloove\.\b": "love",
    r"\bloovee\b": "love",
    r"\blooved\b": "loved",
    r"\byess\b": "yes",
    r"\byall,\b": "you all",
    r"\bhuuge\b": "huge",
    r"\byess!\b": "yes",
    r"\blovvee\b": "love",
    r"\bwaayy\b": "way",
    r"\bloove!\b": "love",
    r"\bbutt\b": "but",
    r"\bloonngg\b": "long",
    r"\bitt\.\b": "it",
    r"\bforeverr\b": "forever",
    r"\bthicc\b": "thick",
    r"\bverry\b": "very",
    r"\bitt\b": "it",
    r"\blonng\b": "long",
    r"\beverr\b": "ever",
    r"\booh\b": "oh", 
    r"\blovedd\b": "loved",
    r"\blott\b": "lot",
    r"\bsoo,\b": "so",
    r"\bmuchh\b": "much",
    r"\byess!!!\b": "yes",
    r"\bwoow\b": "wow",
    r"\blooves\b": "loves",
    r"\byall\.\b": "you all",
    r"\bitt!\b": "it",
    r"\breaally\b": "really",
    r"\bamaazing\.\b": "amazing", 
    r"\bverryy\b": "very",
    r"\bhella\b": "very",
    r"\bamazingg!\b": "amazing", 
}

review_df_all_deduplicate['review_text_cleaned'] = (
    review_df_all_deduplicate['review_text_cleaned']
    .replace(correct_map_1,regex=True)
)

4.8 Punctuation

Remove punctuations.

Code
pattern = "[" + re.escape(string.punctuation) + "]" 
review_df_all_deduplicate["review_text_cleaned"] = (
    review_df_all_deduplicate["review_text_cleaned"].str.replace(pattern, " ", regex=True)
)

punc_lis_remain = []

for text in review_df_all_deduplicate['review_text_cleaned']:
    punc_lis_remain.extend(re.findall(r"[^\w\s]", text))

punc_lis_remain = sorted(set(punc_lis_remain), key=lambda ch: ord(ch)) 
print('Characters (only emoji) left in the reviews: ', '\n', punc_lis_remain)
Characters (only emoji) left in the reviews:  
 ['☝', '⚡', '✅', '✨', '❌', '⭐', '🌺', '🍅', '🍞', '🍯', '🐐', '👌', '👍', '👎', '👏', '💔', '💕', '💗', '💙', '💚', '💜', '💯', '🔥', '😁', '😂', '😃', '😅', '😆', '😉', '😊', '😍', '😏', '😔', '😖', '😞', '😢', '😫', '😬', '😭', '🙁', '🙂', '🙃', '🙌', '🙏', '🤍', '🤓', '🤦', '🤩', '🤯', '🤷', '🥰', '🥶', '🥹', '🥺', '🧼', '🪄', '🫶']

4.9 Random Words

Finally, notice the existence of random and meaningless words that have extremely long length, for example:

Code
review_df_all_deduplicate.loc[80989, 'review_text_cleaned']
'yea i agree yes same mhm jtssngdbkgiiycycitctiheckljjioo'

We applied a heuristic threshold of 23 characters, any individual token exceeding this limit was identified as an anomaly and removed.

Code
def detect_gibberish(w, min_len = 30):    
    if len(w) > min_len:
        return True

    return False

def gibberish_text(text):
    words = text.split()
    return any(detect_gibberish(w) for w in words)


mask = review_df_all_deduplicate['review_text_cleaned'].apply(gibberish_text)

df_process = review_df_all_deduplicate.loc[mask, 'review_text_cleaned']
df_process = pd.DataFrame(df_process)

def drop_long_word(text, threshold = 23):
    emoji = {"✅", "❌"} 
    tokens = text.split()
    kept = []

    for tok in tokens:
        if len(tok) >= threshold:
            if any(e in tok for e in emoji):
                kept.append(tok)

        else:
            kept.append(tok)

    return " ".join(kept)


df_process['review_text_cleaned'] = df_process['review_text_cleaned'].apply(drop_long_word)
review_df_all_deduplicate.loc[df_process.index, 'review_text_cleaned'] = df_process['review_text_cleaned']

This concludes our text cleaning pipeline. While achieving a perfectly clean corpus is impractical given the large volume of unstructured text, this process has substantially reduced noise and standardized the data. Although some residual anomalies may persist, the overall quality of the dataset has been significantly improved for latter modeling.

5. Sentence Embedding

We use Sentence-BERT to generate sentence embeddings. Unlike traditional keyword-based methods (TF-IDF) or static embeddings (Fasttext), SBERT captures deep contextual meaning. For instance, it understands that “not good” is the opposite of “good,” whereas keyword-based models might treat them similarly. Therefore it is the most suitable model to precisely convert sentences into vectors given our current computational resources.

The implementation details are omitted for brevity. Since this model is a transformer-based model that benefits significantly from GPU acceleration, the full execution script is hosted on Google Colab and can be found here.

We directly import the transformed vectors.

Code
df_sbert = pd.read_csv("dataset/sbert_embeddings_1.csv", index_col=0)

6. Clustering

Given the original dataset size of over 240,000 reviews, performing dimensionality reduction (t-SNE) and clustering on the full corpus is computationally prohibitive. Furthermore, visualizing such a high volume of data points on a 2D plane would result in severe overplotting, no distinct clustering patterns will be observed. To address this, we stratified sample (based on user ratings) the original data, to select a representative subset of around 10,000 reviews for the clustering analysis.

Code
df_sbert['rating'] = review_df_all_deduplicate['rating']
Code
sample_ratio = 0.043

_, df_sbert_sample = train_test_split(df_sbert, test_size=sample_ratio, stratify=df_sbert['rating'], 
                                random_state=42)

Two clustering algorithms will be performed on the sentence embedding vectors:

  1. K-Means

  2. DBSCAN

To prepare the embeddings for clustering, we apply PCA to reduce the dimensionality of the feature space. Without the dimensionality reduction, the distinction between nearest and farthest points diminishes, distance-based algorithms like K-Means therefore will be less effective. By projecting the data onto a lower-dimensional subspace while retaining 90% of the cumulative variance, we ensured that the model captures the dominant semantic structures while eliminating noise and improving computational efficiency.

TSNE is used for detecting clusterable dataset. Here it also receives the pca-dimension reduced data.

Before PCA, L2 normalization was performed to project all data points onto a unit hypersphere. This step is crucial because raw embedding magnitudes can vary. By normalizing, we ensure that the clustering algorithm focuses exclusively on the directional alignment of the vectors (representing semantic meaning) rather than their scale. And final centering is required for PCA.

6.1 PCA + K-Means

PCA

Code
mat = df_sbert_sample.iloc[:,:-1].to_numpy()

mat_normalized = normalize(mat, norm='l2')
pca = PCA(n_components=200, random_state= 42).fit(mat_normalized)
cumvar = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(6,4))
plt.plot(range(1, len(cumvar)+1), cumvar, marker="o")
plt.axhline(0.9, color="r", ls="--", lw=1) 
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.ylim(0, 1.01)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('principal component number containing 90% variance: ',np.argmin(cumvar <= 0.9) + 1)

principal component number containing 90% variance:  147
Code
pca = PCA(n_components=147, random_state= 42)
mat_pca = pca.fit_transform(mat_normalized)

TSNE

Code
%%time
tsne_sample = TSNE(n_components=2, perplexity=50,random_state=42, n_jobs = -1).fit(mat_pca)
d:\miniconda3\envs\pydata-book\Lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning:

Could not find the number of physical cores for the following reason:
[WinError 2] 系统找不到指定的文件。
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.

  File "d:\miniconda3\envs\pydata-book\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "d:\miniconda3\envs\pydata-book\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\miniconda3\envs\pydata-book\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "d:\miniconda3\envs\pydata-book\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
CPU times: total: 13min 29s
Wall time: 1min 2s

Plot different perplextity values (5, 10, 20, 30, 40, 50) and random states (28, 42, 68).

Code
fig, axes = plt.subplots(6,3, figsize = (14, 24))

perpre_lis = [5, 10, 20, 30, 40, 50]
seed_lis = [28, 42, 68]

for i, per_val in enumerate(perpre_lis):

    for j, seed in enumerate(seed_lis):
            
            if per_val == 5:
                 tsne = TSNE(n_components=2, perplexity=per_val, random_state=seed, n_jobs = -1).fit(mat_pca)
                 mat_tsne = tsne.embedding_
                 df_tsne = pd.DataFrame(mat_tsne)

                 sns.scatterplot(x=df_tsne.iloc[:,0], y=df_tsne.iloc[:,1], ax=axes[i,j])

                 axes[i,j].set_title(f"Perplexity Value {per_val} and Random State {seed}")
                 axes[i,j].set_xlabel("x_projected")
                 axes[i,j].set_ylabel("y_projected")

            elif per_val == 10:
                tsne = TSNE(n_components=2, perplexity=per_val, random_state=seed, n_jobs = -1).fit(mat_pca)
                mat_tsne = tsne.embedding_
                df_tsne = pd.DataFrame(mat_tsne)

                sns.scatterplot(x=df_tsne.iloc[:,0], y=df_tsne.iloc[:,1], ax=axes[i,j])

                axes[i,j].set_title(f"Perplexity Value {per_val} and Random State {seed}")
                axes[i,j].set_xlabel("x_projected")
                axes[i,j].set_ylabel("y_projected")
                  

            elif per_val == 20:
                tsne = TSNE(n_components=2, perplexity=per_val, random_state=seed, n_jobs = -1).fit(mat_pca)
                mat_tsne = tsne.embedding_
                df_tsne = pd.DataFrame(mat_tsne)

                sns.scatterplot(x=df_tsne.iloc[:,0], y=df_tsne.iloc[:,1], ax=axes[i,j])

                axes[i,j].set_title(f"Perplexity Value {per_val} and Random State {seed}")
                axes[i,j].set_xlabel("x_projected")
                axes[i,j].set_ylabel("y_projected")


            elif per_val == 30:
                tsne = TSNE(n_components=2, perplexity=per_val, random_state=seed, n_jobs = -1).fit(mat_pca)
                mat_tsne = tsne.embedding_
                df_tsne = pd.DataFrame(mat_tsne)

                sns.scatterplot(x=df_tsne.iloc[:,0], y=df_tsne.iloc[:,1], ax=axes[i,j])

                axes[i,j].set_title(f"Perplexity Value {per_val} and Random State {seed}")
                axes[i,j].set_xlabel("x_projected")
                axes[i,j].set_ylabel("y_projected")

            elif per_val == 40:
                tsne = TSNE(n_components=2, perplexity=per_val, random_state=seed, n_jobs = -1).fit(mat_pca)
                mat_tsne = tsne.embedding_
                df_tsne = pd.DataFrame(mat_tsne)

                sns.scatterplot(x=df_tsne.iloc[:,0], y=df_tsne.iloc[:,1], ax=axes[i,j])

                axes[i,j].set_title(f"Perplexity Value {per_val} and Random State {seed}")
                axes[i,j].set_xlabel("x_projected")
                axes[i,j].set_ylabel("y_projected")
            
            else:
                tsne = TSNE(n_components=2, perplexity=per_val, random_state=seed, n_jobs = -1).fit(mat_pca)
                mat_tsne = tsne.embedding_
                df_tsne = pd.DataFrame(mat_tsne)

                sns.scatterplot(x=df_tsne.iloc[:,0], y=df_tsne.iloc[:,1], ax=axes[i,j])

                axes[i,j].set_title(f"Perplexity Value {per_val} and Random State {seed}")
                axes[i,j].set_xlabel("x_projected")
                axes[i,j].set_ylabel("y_projected")

plt.tight_layout()
plt.show()

Pick a representative one.

Code
tsne_sample = TSNE(n_components=2, perplexity=40,random_state=42, n_jobs = -1).fit(mat_pca)
mat_tsne_sample = tsne_sample.embedding_
df_tsne_sample = pd.DataFrame(mat_tsne_sample)

fig, ax = plt.subplots()
sns.scatterplot(x=df_tsne_sample.iloc[:,0], y=df_tsne_sample.iloc[:,1])
ax.set_title(f"Perplexity Value {40} and Random State {42}")
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
plt.show()

We see a larger main cluster, and some smaller clusters that are not separated-well. However, those isolated island clusters and the variable density within the main continent strongly suggest that the dataset contains inherent groupings suitable for clustering algorithms like K-Means and HDBSCAN.

Code
_, sample_idx = train_test_split(df_sbert.index, test_size=sample_ratio, stratify=df_sbert['rating'], 
                                random_state=42)

Let’s try color code by the rating gave of each review.

Code
tsne_sample = TSNE(n_components=2, perplexity=40,random_state=42, n_jobs = -1).fit(mat_pca)
mat_tsne_sample = tsne_sample.embedding_
df_tsne_sample = pd.DataFrame(mat_tsne_sample)
df_tsne_sample['rating'] = review_df_all_deduplicate.loc[sample_idx,'rating'].reset_index()['rating']


fig, ax = plt.subplots()
sns.scatterplot(x=df_tsne_sample.iloc[:,0], y=df_tsne_sample.iloc[:,1], hue=df_tsne_sample['rating'])
ax.set_title(f"Perplexity Value {40} and Random State {42}")
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
plt.show()

No much clear pattern.

6.2 Clustering analysis

Elbow Plot

Code
inertia_lis = []
for i in range(2, 14):
    
    kmeans = KMeans(n_clusters= i, random_state=42)
    kmeans.fit(mat_pca)
        
    inertia_lis.append(kmeans.inertia_)


plt.plot(range(2,14), inertia_lis)
plt.xlabel('Number of Clusters Requested in K-means')
plt.ylabel('Inertia')
plt.xticks(range(2,14))
plt.title('Elbow Plot')
plt.show()

No clear elbow displayed, possible optimal number of clusters: 6 and 9.

Average Silhouette Score

Code
from sklearn.metrics import adjusted_rand_score, silhouette_samples, silhouette_score
cluster_num_list=range(2,14)

trial_num=3

avg_sil_score_list=[]
for k in cluster_num_list:

    kmeans=KMeans(n_clusters=k, random_state=42).fit(mat_pca)
        
    cluster_labels = kmeans.labels_
        
    avg_sil_score=silhouette_score(mat_pca, cluster_labels)
        
    avg_sil_score_list.append(avg_sil_score)

plt.plot(cluster_num_list,avg_sil_score_list)
plt.xlabel('Number of Clusters Requested in K-means')
plt.ylabel('Average Silhouette Score of the K-Means Clusterings')
plt.title('Average Silhouette Score Plot')
plt.show()

Notice that the overall average silhouette scores are very low, the silhouette coefficient assumes that clusters are convex and well-separated. However, high-dimensional text embeddings (even dimension-reduced) generated by SBERT often form non-convex, density-based structures where semantic boundaries are fluid, naturally resulting in lower silhouette scores compared to tabular data.

Cluster sorted matrix

Code
cluster_num_list=range(2,14)

for k in cluster_num_list:
    kmeans = KMeans(n_clusters=k, random_state=42).fit(mat_pca)
    
    a = df_sbert_sample.iloc[:, :-1].copy() 
    a['k_mean_label'] = kmeans.labels_
    a_sorted = a.sort_values(by=["k_mean_label"])
    
    print(f'--- K={k} Cluster Distribution ---')
    print(a_sorted['k_mean_label'].value_counts())
    
    data_for_dist = a_sorted.drop(['k_mean_label'], axis=1)
    dist_mat = squareform(pdist(data_for_dist, metric='euclidean'))
    
    fig, ax = plt.subplots(figsize=(10, 8))
    
    c = ax.imshow(dist_mat, cmap='viridis', aspect='auto', origin='lower') 
    
    ax.set_title(f"Distance Matrix Sorted by Cluster (K={k})")
    ax.set_xlabel('Data Points (Sorted)')
    ax.set_ylabel('Data Points (Sorted)')
    fig.colorbar(c, ax=ax, shrink=0.8, label='Euclidean Distance')
    
    plt.show()
    
    plt.close(fig) 

    del dist_mat
    del a_sorted
    del data_for_dist
--- K=2 Cluster Distribution ---
k_mean_label
1    7395
0    2983
Name: count, dtype: int64

--- K=3 Cluster Distribution ---
k_mean_label
1    4147
0    3657
2    2574
Name: count, dtype: int64

--- K=4 Cluster Distribution ---
k_mean_label
1    3818
3    3477
0    1990
2    1093
Name: count, dtype: int64

--- K=5 Cluster Distribution ---
k_mean_label
1    3640
3    3286
0    1942
2    1087
4     423
Name: count, dtype: int64

--- K=6 Cluster Distribution ---
k_mean_label
1    3090
0    2411
5    1801
3    1640
2    1036
4     400
Name: count, dtype: int64

--- K=7 Cluster Distribution ---
k_mean_label
6    2847
0    2294
5    1734
3    1086
1    1025
2     995
4     397
Name: count, dtype: int64

--- K=8 Cluster Distribution ---
k_mean_label
5    2069
7    1745
0    1716
4    1622
3    1308
2     986
1     537
6     395
Name: count, dtype: int64

--- K=9 Cluster Distribution ---
k_mean_label
7    2507
5    2168
4    1297
1     999
3     995
2     945
0     549
8     517
6     401
Name: count, dtype: int64

--- K=10 Cluster Distribution ---
k_mean_label
5    1701
4    1534
9    1497
7    1309
1    1023
2     967
3     864
0     662
8     441
6     380
Name: count, dtype: int64

--- K=11 Cluster Distribution ---
k_mean_label
5     1708
9     1442
7     1315
0     1195
3      936
2      915
10     859
1      655
4      536
8      439
6      378
Name: count, dtype: int64

--- K=12 Cluster Distribution ---
k_mean_label
5     1594
9     1441
7     1409
0     1186
2      913
10     858
1      648
3      561
4      528
8      439
11     423
6      378
Name: count, dtype: int64

--- K=13 Cluster Distribution ---
k_mean_label
5     1264
12    1166
0     1156
7     1086
3     1037
2      904
10     781
9      623
1      580
4      525
8      444
11     432
6      380
Name: count, dtype: int64

As the number of clusters \(k\) increases, a distinct block-diagonal structure emerges. The diagonal regions gradually become darker, indicating improved intra-cluster cohesion, while the off-diagonal areas become lighter, signaling better inter-cluster separation. However, if we keep increase \(k\), we observe a degradation in separation, where clusters begin to exhibit high similarity to one another again.

Above three clustering metrics presented, we choose \(k\) = 9.

6.3 Re-run K-Means

Code
kmeans = KMeans(n_clusters= 9, random_state=42)
kmeans.fit(mat_pca)
KMeans(n_clusters=9, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Color code by predicted k-means labels.

Code
df_tsne_sample['k_mean_label'] = kmeans.labels_


fig, ax = plt.subplots()
sns.scatterplot(x=df_tsne_sample.iloc[:,0], y=df_tsne_sample.iloc[:,1], hue=df_tsne_sample['k_mean_label'])
ax.set_title(f"Perplexity Value {40} and Random State {42}")
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
plt.show()

We see the clustering result is not perfect, as some isolated clusters contain multiple k-means cluster labels. However, noticing the algorithm is splitting a larger cluster into multiple closed sub-clusters, which is indeed indicating that the algorithm is capturing granular relationships. So the clustering result is somewhat reasonable.

6.4 PCA + DBSCAN

The input matrix is also using the same PCA-dimension-reduced matrix.

Plot K-distance plot for k = 2 ~ 10:

Code
#Find the distance of the kth nearest neighbor
for k in range(2,11):
    nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(mat_pca)
    distances, indices = nbrs.kneighbors(mat_pca)

    plt.plot(np.sort(distances.T[k-1]))
    plt.xlabel('Points Sorted by Distance to k=%s Nearest Neighbor'%k)
    plt.ylabel('k=%s Nearest Neighbor Distance'%k)
    plt.title('Noisy Dataset')
    plt.show()

Based on the k-distance plot with \(k=10\), we identified the most distinct elbow point around the distance value of 0.78. The region below this threshold represents the dense clusters, while the sharp rise beyond 0.8 indicates outliers. Therefore, we selected eps=0.78 and min_pts=10 as our baseline parameters for DBSCAN to balance cluster granularity and noise reduction.

We set the patameter range to \(\epsilon \in\) [0.5, 0.95] with a step of 0.05, min_pts \(\in\) [5, 10, 15, 20] to do a small range best parameter search.

Code
X_input = mat_pca 

eps_range = np.arange(0.5, 0.95, 0.05) 

min_samples_list = [5, 10, 15, 20] 

for minpts in min_samples_list:
    silhouette_scores = []
    num_clusters_list = []
    num_noise_points = []
    valid_eps = [] 

    print(f"--- Processing min_samples = {minpts} ---")

    for eps in eps_range:
        dst = DBSCAN(eps=eps, min_samples=minpts, metric='euclidean', n_jobs=-1)
        labels = dst.fit_predict(X_input)

        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_noise = list(labels).count(-1)
        
        num_clusters_list.append(n_clusters)
        num_noise_points.append(n_noise)
        valid_eps.append(eps)
        
        if n_clusters > 1 and n_noise < len(X_input):
            non_noise_mask = labels != -1
            
            score = silhouette_score(X_input[non_noise_mask], labels[non_noise_mask])
            silhouette_scores.append(score)
        else:
            silhouette_scores.append(0)
        
    fig, ax = plt.subplots(1, 3, figsize=(18, 5), constrained_layout=True)
    fig.suptitle(f'DBSCAN Parameter Search (min_samples={minpts})', fontsize=16)

    ax[0].plot(valid_eps, silhouette_scores, marker='o', color='tab:blue')
    ax[0].set_title('Avg Silhouette Score')
    ax[0].set_xlabel('Epsilon')
    ax[0].set_ylabel('Score')
    ax[0].grid(True, linestyle='--', alpha=0.6)

    ax[1].plot(valid_eps, num_noise_points, marker='o', color='tab:red')
    ax[1].set_title('Number of Noise Points')
    ax[1].set_xlabel('Epsilon')
    ax[1].set_ylabel('Count')
    ax[1].grid(True, linestyle='--', alpha=0.6)

    ax[2].plot(valid_eps, num_clusters_list, marker='o', color='tab:green')
    ax[2].set_title('Number of Clusters')
    ax[2].set_xlabel('Epsilon')
    ax[2].set_ylabel('Count')
    ax[2].grid(True, linestyle='--', alpha=0.6)

    plt.show()
--- Processing min_samples = 5 ---

--- Processing min_samples = 10 ---

--- Processing min_samples = 15 ---

--- Processing min_samples = 20 ---

While the grid search showed highest Silhouette scores at \(\epsilon \ge 0.70\), these settings resulted in too little noise detected, and meaningful clusters are merged into 1-2 larger groups. We identified the optimal operating point at \(\epsilon =0.65\) with min_samples=5.

6.5 Re-run DBSCAN + Clustering Analysis

Code
BEST_EPS = 0.65        
BEST_MIN_SAMPLES = 5   
X_INPUT = mat_pca

db = DBSCAN(eps=BEST_EPS, min_samples=BEST_MIN_SAMPLES, metric='euclidean', n_jobs=-1)
labels = db.fit_predict(X_INPUT)

a = df_sbert_sample.iloc[:, :-1].copy() 
a['db_label'] = labels
a_sorted = a.sort_values(by=["db_label"])

print(a_sorted['db_label'].value_counts())

data_for_dist = a_sorted.drop(['db_label'], axis=1)
dist_mat = squareform(pdist(data_for_dist, metric='euclidean'))


fig, ax = plt.subplots(figsize=(12, 10))

im = ax.imshow(dist_mat, cmap='viridis', aspect='auto', origin='lower')

cbar = plt.colorbar(im, ax=ax, shrink=0.8)
cbar.set_label('Euclidean Distance')


ax.set_title(f"DBSCAN Sorted Distance Matrix\n(eps={BEST_EPS}, min_samples={BEST_MIN_SAMPLES}")
ax.set_xlabel('Data Points (Sorted by Cluster Label)')
ax.set_ylabel('Data Points (Sorted by Cluster Label)')

plt.tight_layout()
plt.show()

del dist_mat
db_label
 0    7547
-1    2777
 4      26
 1      10
 2       7
 3       7
 5       4
Name: count, dtype: int64

Code
df_tsne_sample['db_label'] = labels


fig, ax = plt.subplots()
sns.scatterplot(x=df_tsne_sample.iloc[:,0], y=df_tsne_sample.iloc[:,1], hue=df_tsne_sample['db_label'])
ax.set_title(f"Perplexity Value {40} and Random State {42}")
ax.set_xlabel("x_projected")
ax.set_ylabel("y_projected")
plt.show()

However, by visually checking the post-cluster sorted matrix and the color-coded tsne plot by predicted label, the performance of DBSCAN was clearly worse that of K-Means. The clusters appeared less distinct and contained misclasscified noise. Consequently, we choose K-Means with \(k=9\) as our optimal clustering solution.

7. Interpretate Result

Code
kmeans = KMeans(n_clusters= 9, random_state=42)
kmeans.fit(mat_pca)
KMeans(n_clusters=9, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
df_sample = review_df_all_deduplicate.loc[sample_idx]
df_sample['predicted_label'] = kmeans.labels_

Plot all the numercial varaibles (except massive na valued variables) across clusters.

Code
num_col = ['rating', 'total_feedback_count', 'total_neg_feedback_count', 'total_pos_feedback_count']

for num in num_col :
    plt.figure(figsize=(6,4))
    sns.boxplot(data=df_sample, x='predicted_label', y='rating')  
    plt.xlabel('Predicted cluster')
    plt.ylabel(f'{num}')
    plt.title('num_col by cluster')
    plt.tight_layout()
    plt.show()

Code
print(df_sample.groupby('predicted_label').agg({'rating': 'mean', 'review_text': 'count'    
}).sort_values(by='rating'))
                   rating  review_text
predicted_label                       
1                3.994995          999
3                4.013065          995
7                4.183486         2507
6                4.359102          401
8                4.361702          517
4                4.387047         1297
5                4.399446         2168
2                4.400000          945
0                4.520947          549

The table summarizes the statistical profile of the 9 identified clusters, sorted by average rating. We observe a narrow rating distribution ranging from 3.99 to 4.52, indicating an overall positive sentiment across the sampled dataset. Cluster 7 is the dominant group by volume, while Cluster 1 and Cluster 0 represent the lowest and highest satisfaction extremes, respectively.

Code
sorted_clusters = sorted(df_sample['predicted_label'].unique())


print("-" * 60)

for cluster_id in sorted_clusters:

    cluster_data = df_sample[df_sample['predicted_label'] == cluster_id]
    
    combined_text = " ".join(cluster_data['review_text_cleaned'].astype(str).tolist())

    ngrams_list = generate_ngrams(combined_text, n_gram=2)
    
    top_keywords = Counter(ngrams_list).most_common(10)
    
    print(f"Cluster {cluster_id} (Size: {len(cluster_data)})")
    print(f"Top Keywords: {top_keywords}")
    print("-" * 60)
------------------------------------------------------------
Cluster 0 (Size: 549)
Top Keywords: [('dermalogica sampling', 91), ('gifted dermalogica', 78), ('dead skin', 73), ('sensitive skin', 72), ('daily microfoliant', 59), ('skin feeling', 52), ('complimentary dermalogica', 52), ('leaves skin', 49), ('skin feels', 49), ('dry skin', 44)]
------------------------------------------------------------
Cluster 1 (Size: 999)
Top Keywords: [('full size', 68), ('love product', 51), ('received product', 40), ('received sample', 35), ('sample size', 33), ('skin feel', 31), ('goes long', 29), ('long way', 29), ('highly recommend', 27), ('product free', 27)]
------------------------------------------------------------
Cluster 2 (Size: 945)
Top Keywords: [('makeup remover', 156), ('eye makeup', 129), ('remove makeup', 94), ('removes makeup', 78), ('waterproof mascara', 77), ('cleansing balm', 76), ('leaves skin', 69), ('skin feeling', 62), ('removing makeup', 60), ('love product', 58)]
------------------------------------------------------------
Cluster 3 (Size: 995)
Top Keywords: [('received product', 69), ('leaves skin', 60), ('sensitive skin', 59), ('skin feel', 56), ('long way', 55), ('goes long', 53), ('love product', 48), ('dry skin', 47), ('makes skin', 47), ('skin feeling', 46)]
------------------------------------------------------------
Cluster 4 (Size: 1297)
Top Keywords: [('dry skin', 137), ('sensitive skin', 126), ('leaves skin', 115), ('face wash', 114), ('skin feeling', 102), ('cleansing balm', 100), ('love cleanser', 97), ('skin feels', 91), ('oily skin', 81), ('acne prone', 64)]
------------------------------------------------------------
Cluster 5 (Size: 2168)
Top Keywords: [('dry skin', 374), ('skin feels', 171), ('sensitive skin', 166), ('skin feel', 151), ('oily skin', 143), ('makes skin', 136), ('long way', 122), ('goes long', 118), ('received product', 112), ('skin feeling', 109)]
------------------------------------------------------------
Cluster 6 (Size: 401)
Top Keywords: [('dry skin', 48), ('jet lag', 47), ('summer fridays', 42), ('love mask', 42), ('lag mask', 39), ('sensitive skin', 36), ('overnight mask', 27), ('face mask', 23), ('makes skin', 19), ('next morning', 18)]
------------------------------------------------------------
Cluster 7 (Size: 2507)
Top Keywords: [('sensitive skin', 269), ('acne prone', 162), ('love product', 160), ('using product', 152), ('prone skin', 132), ('dry skin', 119), ('difference skin', 117), ('skin tone', 116), ('skin feels', 113), ('made skin', 111)]
------------------------------------------------------------
Cluster 8 (Size: 517)
Top Keywords: [('sun protection', 154), ('protection factor', 136), ('white cast', 66), ('love sunscreen', 39), ('best sunscreen', 32), ('oily skin', 24), ('sunscreen used', 24), ('leave white', 23), ('doesnt leave', 23), ('unseen sunscreen', 23)]
------------------------------------------------------------

The largest portion of the dataset is driven by efficacy regarding specific skin types. Cluster 7 (the largest group) and Cluster 4 focus heavily on problem-solving for “acne prone” and “sensitive skin,” indicating that efficacy is the primary driver of engagement. Meanwhile, Cluster 5 captures the positive experiences of “dry skin” users seeking hydration.

The model isolated distinct product lines. Cluster 8 identifies “sun protection” with a specific negative sentiment regarding “white cast,” highlighting a critical product defect. Cluster 6 is unique as it is dominated by a single viral product (“Summer Fridays Jet Lag Mask”), demonstrating how specific ‘hero products’ can form their own semantic clusters.

While cluster 0, 1, and 3 are the non-product-related clusters. Cluster 0 is characterized by terms like “gifted” and “dermalogica sampling,” representing incentivized reviews that may introduce positive bias. Conversely, Cluster 1 focuses on “sample size” complaints.

Visualization of top 50 bigrams for each cluster.

Code
n_clusters = 9
n_cols = 3
n_rows = math.ceil(n_clusters / n_cols)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 15))
axes = axes.flatten() 

sorted_clusters = sorted(df_sample['predicted_label'].unique())


for i, cluster_id in enumerate(sorted_clusters):
    ax = axes[i]
    
    cluster_data = df_sample[df_sample['predicted_label'] == cluster_id]
    
    combined_text = " ".join(cluster_data['review_text_cleaned'].astype(str).tolist())
    
    ngrams_list = generate_ngrams(combined_text, n_gram=2)
    
    ngram_counts = dict(Counter(ngrams_list).most_common(50))
    
    if len(ngram_counts) > 0:
        wc = WordCloud(
            width=800, 
            height=400, 
            background_color='white', 
            colormap='viridis', 
            max_font_size=100
        ).generate_from_frequencies(ngram_counts)
        
        ax.imshow(wc, interpolation="bilinear")
        ax.set_title(f"Cluster {cluster_id} Theme\n(Size: {len(cluster_data)})", fontsize=14)
    else:
        ax.text(0.5, 0.5, "Not enough data", ha='center', va='center')
        
    ax.axis('off') 

for j in range(i + 1, len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()

8. Conclusion

To conclude, by replacing static star ratings with an unsupervised learning result, we transformed a sample unstructured customer feedback into 9 distinct semantic clusters. We experimented two types of algorithms: K-Means and DBSCAN, and by comparison of the 2 algorithms, we found the K-Means is a better fit to our case. Then we interpreted our clustering results in terms of their business value, and classified the 9 clusterings into 3 distinct categories: skin-related topic, product-related topic, and non-product-related noise topic.

Next Step

In this project, the whole clustering building process is only using around 10,000 sample reviews, the clustering result may not represent the whole reviews. Additionally, we identify the limitation of methods we used in this project, such as the computation limitation of the t-sne, and not the most ideal cluster assumptions for both algorithms, a more robust approach is needed.